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ABSTRACT 

A dense linear system solver package recently developed at the University of Texas at Austin for 
distributed memory machine (e.g. Intel Paragon) has been reviewed and analyzed. The package 
contains about 45 software routines, some written in FORTRAN, and some in C-language, and 
forms the basis for parallel/distributed solutions of systems of linear equations encountered in 
many problems of scientific and engineering nature. The package, being studied by the 
Computer Applications Branch of the Analysis and Computation Division, may provide a 
significant computational resource for NASA scientists and engineers in parallel/distributed 
computing. 

Since the package is new and not well tested or documented, many of its underlying concepts and 
implementations were unclear, our task was to review, analyze, and critique the package as a step 
in the process that will enable scientists and engineers to apply it to the solution of their problems. 

All routines in the package were reviewed and analyzed. Underlying theory or concepts which 
exist in the form of published papers or technical reports, or memos, were either obtained from 
the author, or from the scientific literature; and general algorithms, explanations, examples, 
and critiques have been provided to explain the workings of these programs. Wherever the 
things were still unclear, communications were made with the developer (author), either by 
telephone or by electronic mail, to understand the workings of the routines. Whenever possible, 
tests were made to verify the concepts and logic employed in their implementations. A detailed 
report is being separately documented to explain the workings of these routines. 


INTRODUCTION 

The solutions of linear systems of equations is needed in many science and engineering 
applications which include structural mechanics, fluid dynamics, chemical reactions, heat 
transfer, weather prediction, and climate modeling. Many physical phenomena which model a 
system of ordinary or partial differential equations eventually lead to the solution of a linear 
system of equations when discretized. 

For many years, computer codes have been developed to solve these systems of equations and 
made available in the form of standard software packages such as LINPACK and LAPACK. Both 
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these packages are based on BLAS1, BLAS2, BLAS3 (Basic Linear Algebra Subprograms) 
which math ematicians developed over 15 years to do matrix computations [13, 20, 32, 35, 36, 
38]. Whereas, LINPACK [14, 15, 37, 38, 42] provides codes for solving these systems of 
equations on uniprocessor machines, LAPACK [3, 7, 8, 9, 14, 15, 23, 32, 34, 37] contains the 
software routines to solve these systems on shared memory multiprocessors and supercomputers. 
Attemps have recently been made [2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 15, 16, 18, 19, 21, 22, 23, 24, 27, 
28, 31, 33, 34, 40, 44]) to develop the software packages to solve these systems on distributed 
memory machines such as Intel’s hypercubes and a network of machines. The anticipation is that 
future trends of the scientific computations will be on distributed memory machines and/or on 
network of smaller machines, and hence, this thrust in the development of codes for these 
architectures. 

Although one can program distributed memory machines to solve scientific problems, the state of 
the art has not developed to the extent that the programmer does not have to be concerned about 
the machine architecture and its implications for algorithm selection and implementation. To 
advance the state of this art, as well as to apply the existing art to various scientific and 
engineering applications, the Langley Research Center, as part of the High Performance 
Computing and Communication Program, acquired an experimental machine called Paragon. 
This is developed by Intel Corporation of Oregon and is designed to interconnect hundreds of 
processors in a rectangular mesh network. Each of the procssors has its own memory (32 MB), 
and the system can simultaneously and cooperatively solve many applications on a subset of the 
total number of processors (called partition) for each application [41, 43]. 

The machine, along with the other software, came with a software package called MPLINPACK 
or Multiple Processors Linear Equations Solver Package. This software has proven to be very 
efficient on the Paragon. Our objective was to review and analyze, the software modules to better 
understand the underlying theory and implementation features that are responsible for its 
efficiency. 

There are about 45 modules written both in FORTRAN and C-languages, which can be 
categorized into five classes - Data Distribution Routines, Data Communication Routines, 
Computation Routines, Global Combination Routines, and Auxiliary Routines. These were 
reviewed and analyzed. Sample tests were ran to validate aand further understand the underlying 
theory and concepts for some of the modules. 

In certain modules, the concepts were so new or recent that the information was only available 
through private technical reports or memos from the author (Professor Robert Van de Geign at the 
University of Texas at Austin). In particular, the concept of TREE, CTREE in information 
broadcasting [17, 25, 41], Efficient Global Combine Operations [18, 19], and the implementation 
of FORCE-type messages [26, 43] in certain communication routines were researched and 
discussions made with the developer, as well as researchers in CAB and Intel Corporation 
scientists to understand and analyze these routines. A detailed technical report is being published 
and will be made available to the staff of CAB and other interested researchers. 
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An interesting problem came up when acceptance tests, consisting of a large system of equations 
(of the order of 9000 x 9000) were being conducted by Mr. Bulle of CAB on a rectangular grid of 
nodes for this machine (Paragon). It was discovered that the performance of the system was much 
better on an elongated grid (2 x 32 grid) than on its transposed grid (32 x 2 grid). After careful 
analysis, and discussions with the author (developer), we learned that the software employed 
Gauss elimination (row) procedure and used different techniques of information communication 
in one direction of the grid (across the columns) than in the other direction (down the columns). 
That is, the information (pivot row indices, column and row factored panels) broadcast across the 
column nodes is done in pipelined (ring) fashion, whereas the information (pivot element and 
pivot row indices) broadcast down the column nodes uses the CTREE technique. Since the 
CTREE broadcast technique is slower than the pipelining technique, the fewer node rows in a grid 
would give better performance. 
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